Reading in data

You can import pretty much any data format into R if you know the right command and (package):

csv files

Last week we read in a csv file saved in the data folder within our project folder using the (tidyverse/readr) function read_csv("filepath/filename"). That is, we read the data locally (from a local device).

  • read_csv() will guess the format/variable type of each column; if we want more control over how the data is read in, we can tell R the variable type for each column with an argument – col_types(cfin) – where the letters in the parentheses are shorthand for variable types entered in the sequence of the columns/variables we want the format applied to

    • “c” = character
    • “f” = factor
    • “i” = integer
    • “n” = number
    • “d” = double
    • “l” = logical
  • We can read in csv files directly from a URL as well without saving the csv file directly into the computer. When it is possible to read the data from a URL, this makes the script more reproducible by others (assuming the link doesn’t disappear!).

    • read_csv("url")

Excel files

csv formats are simple text files, which makes them easy to read. Data is often stored and shared in excel files, which are harder to read. The readxl package makes this easier.

  • Function: read_excel("filepath/filename.xlsx")
  • Arguments: sheet = 1 to read in the first sheet; if sheets are named, can call sheets by name
  • Arguments: skip = 1 to skip the first row; excel spreadsheets are often written for humans to read rather than computers, so header information is more common
  • Arguments: range = b3:c100 to read in only values in the identified range of cells

Excel files cannot be read in from a URL. They must first be downloaded to your computer and read in locally. (You can also download csv files first and read them in locally.)

  • download.file("url", "destfolder/filename.xlsx")

If an excel file has a download link, it is more reproducible to download the file via the script. To insure such a script works for anyone, you can include a code snippet that creates a folder to download to, e.g.,

if (!dir.exists("destfolder")){
  dir.create("destfolder")
}

More dplyr

dplyr cheatsheet!

\(\color{green}{\text{arrange()}}\) - reorder \(\color{green}{\text{rows}}\)

  • Reverse the order (largest to smallest) with desc()

\(\color{blue}{\text{summarize()}}\) - summarize \(\color{blue}{\text{variables}}\)

Summarize according to a summary function

Summary functions include

Summary Functions
first(): first value sum(): sum of values
last(): last value n(): number of values
nth(.x, n): nth value n_distinct(): number of distinct values
min(): minimum value mean(): mean value
max(): maximum value var(): variance
median(): median value sd(): standard deviation
quantile(.x, probs = .25): *IQR(): interquartile range

Things to note:

  • multiple summary functions can be called within the same command
  • we can give the summary values new names (though we don’t have to);

Summarize is especially helpful when combined with group_by

\(\color{green}{\text{group_by()}}\) - group \(\color{green}{\text{rows}}\)

Aggregate/group by value(s) of column(s).

  • we can group by more than one variable at once
  • we can perform other operations after group_by as well, like mutate

\(\color{blue}{\text{mutate()}}\) - create new \(\color{blue}{\text{variables}}\)

Create new columns or alter existing columns

  • we can mutate new variables as functions of other variables (ratios, conditions, ranks, etc.)
  • we can mutate multiple variables in the same command
  • You can mutate based on conditions, e.g., : if_else, case_when
df <- df %>% 
  mutate(newvar = if_else(condition, value_if_true, value_if_false, value_if_na))

df <- df %>% 
  mutate(newvar = case_when(
    condition1 ~ value1, 
    condition2 ~ value2, 
    condition3 ~ value3, 
    TRUE ~ value_everything_else)
  • \(\color{blue}{\text{summarize(across())}}\) - apply summary function to select \(\color{blue}{\text{variables}}\)
  • \(\color{blue}{\text{summarize(across(where()))}}\) - apply summary function to \(\color{blue}{\text{variables}}\) by conditions
  • across() can also be used within mutate

Let’s Play with R!

First go to slack and copy the practice script for today (week2script.R) into your weeklymaterials/scripts folder from last week. Then open an RStudio session using the weeklymaterials.Rproj file.